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ABSTRACT 


With the rapid emergence of K-12 online learning platforms, 
a new era of education has been opened up. It is crucial to 
have a dropout warning framework to preemptively iden- 
tify K-12 students who are at risk of dropping out of the 
online courses. Prior researchers have focused on predict- 
ing dropout in Massive Open Online Courses (MOOCs), 
which often deliver higher education, i.e., graduate level 
courses at top institutions. However, few studies have fo- 
cused on developing a machine learning approach for stu- 
dents in K-12 online courses. In this paper, we develop 
a machine learning framework to conduct accurate at-risk 
student identification specialized in K-12 multimodal online 
environments. Our approach considers both online and of- 
fline factors around K-12 students and aims at solving the 
challenges of (1) multiple modalities, ie., K-12 online envi- 
ronments involve interactions from different modalities such 
as video, voice, etc; (2) length variability, i.e., students with 
different lengths of learning history; (3) time sensitivity, i-e., 
the dropout likelihood is changing with time; and (4) data 
imbalance, i.e., only less than 20% of K-12 students will 
choose to drop out the class. We conduct a wide range of of- 
fline and online experiments to demonstrate the effectiveness 
of our approach. In our offline experiments, we show that 
our method improves the dropout prediction performance 
when compared to state-of-the-art baselines on a real-world 
educational dataset. In our online experiments, we test our 
approach on a third-party K-12 online tutoring platform for 
two months and the results show that more than 70% of 
dropout students are detected by the system. 
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With the recent development of technologies such as digi- 
tal video processing and live streaming, there has been a 
steady increase in the amount of K-12 students studying 
online courses worldwide. Online classes have become nec- 
essary complements to public school education in both de- 
veloping and developed countries [31, 27, 37, 42, 35, 34, 
36]. Different from public schools that focusing on teach- 
ing in traditional brick-and-mortar classrooms with 20 to 50 
students, online classes open up a new era of education by 
incorporating more personalized and interactive experience 
[20, 33, 45, 9, 57]. 


In spite of the advantages of this new learning opportunity, 
a large group of online K-12 students fail to finish course 
programs with little supervision either from their parents or 
teachers. Students drop out of the class may be due to many 
reasons such as lack of interests or confidence, mismatches 
between course contents and students’ leaning paths or even 
no immediate grade improvements from their parents’ per- 
spectives [37, 39, 25]. Therefore, it is crucial to build an 
early dropout warning system to identify such at-risk online 
K-12 students and provide timely interventions. 


A large spectrum of approaches have been developed and 
successfully applied in predicting dropout in Massive Open 
Online Courses (MOOCs) [29, 47, 43, 58, 3, 5]. However, 
identifying dropout of K-12 students on online courses are 
significantly different from MOOCs based attrition predic- 
tion. The main differences are summarized as follows: 


e watching v.s. interaction: Even though both learn- 
ing are conducted in the online environment, learners’ 
engagements on MOOCs and K-12 online platforms 
vary a lot [20]. In MOOCs, learners mainly watch 
the pre-recorded video clips and discuss questions and 
assignments with teaching assistants on the MOOC fo- 
rums [18]. While in K-12 online courses, students fre- 
quently interact with the online tutors in a multimodal 
and immersive learning environment. The tutors may 
answer students’ questions, summarize the knowledge 
points, take notes for students, etc. 


e spontaneous action v.s. paid service: Learners 
on existing popular MOOC platforms such as Cours- 
era’, edX?, etc. are adults, who aim at continuing 
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their lifelong learning in higher education and obtain- 
ing professional certificates such as Coursera’s Spe- 
cializations and edX’s MicroMasters. MOOC learners 
are typically self-motivated and self-driven. On the 
contrary, most available K-12 online education choices 
are commercialized in service industry. Students pay 
to enroll online tutoring programs to strengthen their 
in-class knowledge levels and improve their grades in 
final exams. As a result, there are numerous out-of- 
class activities involved in K-12 online learning such as 
follow-ups from personal instructors, satisfaction sur- 
vey and communications with students’ parents, etc. 
These out-of-class activities rarely appear in MOOC 
based learning. 


e high v.s. low dropout rate: The dropout rate for 
MOOC based program is often as high as 70% - 93% 
[31, 52] while the dropout rate in K-12 online courses 
is below 20%. 


Therefore, it is important to study approaches to identify at- 
risk K-12 online students and build an effective yet practical 
warning system. However, this task is rather challenging due 
to the following real-world characteristics: 


e multiple modalities: K-12 online learning is con- 
ducted in an immersive and multimodal environment. 
Students and instructors interact with each other visu- 
ally and vocally. There are a lot of multimodal factors 
that may influence the final decisions of dropout, rang- 
ing from interaction qualities between students and 
teachers, teaching speeds, volumes, emotions of the 
online tutors, etc. 


e length variability: Students join and leave the on- 
line platforms independently, which results in a collec- 
tion of observation sequences with different lengths. A 
dropout prediction system should be able to (1) make 
predictions for students with various lengths of learn- 
ing histories; and (2) handle newly enrolled students. 


e data imbalance: The overall dropout rate for K-12 
online classes is usually below 20%, which makes the 
training samples particularly imbalanced. 


The objective of this work is to study and develop models 
that can be used for accurately identifying at-risk K-12 stu- 
dents in multimodal online environments. More specifically, 
we are interested in developing models and methods that 
can predict risk scores (dropout probabilities) given the his- 
tory of past observations of students. We develop a data 
augmentation technique to alleviate class imbalance issues 
when considering the multi-step ahead prediction tasks. We 
conduct extensive sets of experiments to examine every com- 
ponent of our approach to fully evaluate the dropout pre- 
diction performance. 


Overall this paper makes the following contributions: 


e We design various types of features to fully capture 
both in-class multimodal interactions and out-of-class 


activities. We create a data augmentation strategy to 
simulate the time-sensitive changes of dropout likeli- 
hood in real scenarios and alleviate the data imbalance 
problem. 


e We design a set of comprehensive experiments to un- 
derstand prediction accuracy and performance impact 
of different components and settings from both qual- 
itative and quantitative perspectives by using a real- 
world educational dataset. 


e We push our approach into a real production environ- 
ment to demonstrate the effectiveness of our proposed 
dropout early warning system. 


The remainder of the paper is organized as follows: Section 
2 discusses the related research work of dropout prediction 
in both public school settings and MOOCs scenarios. Com- 
parisons with relevant researches are discussed. In Section 3, 
we introduce assumptions when building a practical at-risk 
student identification system and formulate the prediction 
task. Section 4, we describe the details about our predic- 
tion framework, which include (1) extracting various types 
of features from both online classroom recordings and offline 
activity logs (See Section 4.1); and (2) data augmentation 
technique that helps us create sufficient training pairs and 
overcomes the class imbalance problem (See Section 4.2). In 
Section 5, we (1) quantitatively show that our model sup- 
ports better dropout predictions than alternative approaches 
on an educational data derived from a third party K-12 on- 
line learning platform and (2) demonstrate the effectiveness 
of our proposed approach in the a real production environ- 
ment. We summarize our work and outline potential future 
extensions in Section 6. 


2. RELATED WORK 


Dropout prediction and at-risk student identification have 
been gaining popularity in both the educational research 
and the AI communities. Understanding the reasons behind 
dropouts and building early warning systems have attracted 
a growing interest of academics in the learning analytics 
area. Broadly speaking, existing research regarding dropout 
prediction can be categorized by learning scenarios and di- 
vided into two categories: (1) public school dropout (See 
Section 2.1); and (2) MOOCs dropout (See Section 2.2). 


2.1. Public School Dropout 


Education institutions are faced with the challenges of low 
student retention rates and high number of dropouts [45, 
32]. For examples, in the United States, almost one-third 
of public high school students fail to graduate from high 
school each year [40, 7] and over 41% of undergraduate stu- 
dents at four-year institutions failed to graduate within six 
years in Fall 2009 [38]. Hence, research work has focused on 
predicting the dropout problem and developing dropout pre- 
vention strategies [40, 41, 8, 55, 13, 30, 10, 49]. Zhang and 
Rangwala develop an at-risk student identification approach 
based on iterative logistic regression that utilizes all the in- 
formation from historical data from previous cohorts [59]. 
The state of Wisconsin creates a predictive dropout early 
warning system for students in grades six through nine and 
provides predictions on the likelihood of graduation for over 
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225,000 students [30]. The system utilizes ensemble learn- 
ing and is built on the steps of searching through candidate 
models, selecting some subsets of best models, and averag- 
ing those models into a single predictive model. Lee and 
Chung address the class imbalance issue using the synthetic 
minority over-sampling techniques on 165,715 high school 
students from the National Education Information System 
in South Korea [33]. Ameri et al. consider different groups 
of variables such as family background, financial, college en- 
rollment and semester-wise credits and develop a survival 
analysis framework for early prediction of student dropout 
using Cox proportional hazards model [1]. 


2.2 MOOCs Dropout 


With the recent boom in educational technologies and re- 
sources both in industry and academia, MOOCs have rapidly 
moved into a place of prominence in the mind of the public 
and have attracted a lot of research attentions from many 
communities in different domains. Among all the MOOC re- 
lated research questions, dropout prediction problem emerges 
due to the surprisingly high attrition rate [54, 19, 26, 23, 44, 
56, 6, 11, 12, 20]. Ramesh et al. treat students’ engagement 
types as latent variables and use probabilistic soft logic to 
model the complex interactions of students’ behavioral, lin- 
guistic and social cues [43]. Sharkey et al. conduct a series 
of experiments to analyze the effects of different types of 
features and choices of prediction models [47]. Kim et al. 
study the in-video dropouts and interaction peaks, which 
can be explained by five identified student activity patterns 
[25]. He et al. propose two transfer learning based logis- 
tic regression algorithms to balance the prediction accuracy 
and inter-week smoothness [21]. Tang et al. formulate the 
dropout prediction as a time series forecasting problem and 
use a recurrent neural network with long short-term mem- 
ory cells to model the sequential information among features 
[50]. Both Yang et al. and Mendez et al. conduct sur- 
vival analysis to investigate the social and behavioral fac- 
tors that affect dropout along the way during participating 
in MOOGs [58, 39]. Detailed literature surveys on MOOC 
based dropout prediction are reviewed comprehensively in 
[51, 5). 


In this work, we focus on identifying at-risk students in K- 
12 online classes, which is significantly distinguished from 
dropout predictions in either public school or MOOCs based 
scenarios. In the K-12 multimodal learning environment, the 
learning paradigm focuses on interactions instead of watch- 
ing. The interactions come from different modalities, which 
rarely happen in traditional public schools and MOOC based 
programs of higher education. Furthermore, as a paid ser- 
vice, K-12 online learning involves both in-class and out-of- 
class activities and both of them contain multiple factors 
that could lead to class dropouts. These differences make 
existing research works inapplicable in K-12 online learn- 
ing scenarios. To the best of our knowledge, this is the 
first research that comprehensively studies the dropout pre- 
diction problem in K-12 online learning environments from 
real-world perspectives. 


3. PROBLEM FORMULATION 

3.1 Assumptions 

In order to characterize the K-12 online learning scenarios, 
we need to carefully consider every cases in the real-world 


environment and make reasonable assumptions. Without 
loss of generality, we have the following assumptions in the 
rest of the paper. 


ASSUMPTION 1 (RECENCY EFFECT). Time spans between 
the date of dropout and the date of last online courses vary 
a lot. Students may choose to drop the class right after one 
course or quit after two weeks of no course. Therefore, the 
per-day likelihood of dropout should be time-aware and the 
closer to the dropout date, the more accurate the dropout 
prediction should be. 


ASSUMPTION 2. (MULTI-STEP AHEAD FORECAST). The 
real-world dropout prediction framework should be able to 
flexibly support multi-step ahead predictions, t.e., the next- 
day and next-week probabilities of dropout. 


3.2. The Prediction Problem 


In this work, our objective is to predict the value of future 
status for the target student given his or her past learning 
history, i.e., observations collected from K-12 online plat- 
forms. More specifically, let S be the collection of all stu- 
dents and for each student s,s € S, we assume that we 
have observed a sequence of n; past observation-time pairs 
{< xj, tj >}f21, x} € X°, and tj € T*, such that 0 < tj < 
tj41, and xj is the observation vector made at time (¢}) 
for student s. X* and T° represent the collections of ob- 
servations and timestamps for student s. Correspondingly, 
let Y° be the collection of indicators of status (dropout, on- 
going or completion) of student s at each timestamp, i.e., 
Y* = {yj}72,. Let A be the future time span in multi-step 
ahead prediction. Time t;,,4 (A > 0) is the time at which 
we would like to predict the student’s future status ois oe 


Please note that we omit the explicit student index s in the 
following sections for notational brevity and our approach 
can be generalized into a large samples of student data with- 
out modifications. 


4. THE PREDICTION FRAMEWORK 


The dropout prediction for K-12 online courses is a time- 
variant task. A student who just had the class should have a 
smaller dropout probability compared with a student haven’t 
take any class for two weeks. Therefore, when designing a 
real applicable approach of dropout prediction, such recency 
effect, i.e., Assumption 1, has to be considered. In this work, 
we extract both static and time-variant features from dif- 
ferent categories to capture the factors leading to dropout 
events comprehensively (See Section 4.1). Furthermore, we 
create a label augmentation technique that not only allevi- 
ates the class imbalance problem when building predictive 
framework for K-12 online classes, but incorporates the re- 
cency effect into label constructions (See Section 4.2). The 
learning of our dropout model is discussed in Section 4.3 and 
the overall learning procedure is summarized in Section 4.4. 


4.1 Features 

In this section, we develop a distinguished set of features 
for at-risk student identification from the real-world K-12 
online learning scenarios, which can be divided into three 
categories: (1) in-class features that focus on K-12 students’ 
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online behaviors during the class (See Section 4.1.1); (2) out- 
of-class features that consider as much as possible real-world 
factors happened after the class, which may influence the 
dropout decisions (See Section 4.1.2); and (3) time-variant 
features that include both historical performance of teachers 
and aggregated features of student activities within fixed- 
size windows (See Section 4.1.3). 


4.1.1  In-class Features 

Different from adults who continue their learning in higher 
education on MOOC based platforms, K-12 students come 
for grade improvements. This intrinsic difference in their 
learning goals leads to contrasting learning behaviors. Adult 
learners of MOOCs study independently by various activi- 
ties, such as viewing lecture videos, posting questions on 
MOOC forums, etc. This results in various types of in- 
class click-stream data, which are shown to be effective in 
dropout prediction in many existing research works [15, 51, 
54, 6, 12, 11, 20]. However, such click based activities barely 
happen in K-12 online scenarios. Instead, there are frequent 
voice based interactions between K-12 students and their 
teachers. The teachers not only make every effort to clarify 
unsolved questions that students remain from their public 
schools, but are responsible for arousing students’ learning 
interests and building their studying habits. Therefore, we 
focus on extracting in-class multimodal features specializing 
in K-12 tutoring scenarios from the online classroom videos. 
We categorize our features as follows. Table 1 illustrates 
some examples of in-class features from different categories. 


e Prosodic features: speech-related features such as 
signal energy, loudness, Mel-frequency cepstral coeffi- 
cients (MFCC), etc. 


e Linguistic features: language-related features such 
as statistics of part-of-speech tags, the number of in- 
terregnum words, distribution of length of sentences, 
voice speed of each sentence, etc. 


e Interaction features: features such as the number 
of teacher-student Q&A rounds, the numbers of times 
teachers remind students to take notes etc. 


To extract all the features listed in Table 1, we first extract 
audio tracks from classroom recordings on both teacher’s 
and student’s sides. Then we extract acoustic features di- 
rectly from classroom audio tracks by utilizing the widely 
used open-sourced tool, i.e., OpenSmile?. We obtain class- 
room transcriptions by passing audio files to a self-trained 
automatic speech recognition (ASR) module. After that, 
we extract both linguistic and interaction features from the 
conversational transcripts. Finally, we concatenate all fea- 
tures from above categories and apply a linear PCA to get 
the final dense in-class features. The entire in-class feature 
extraction workflow of our approach is illustrated in Figure 
1. 


Please note that due to the benefits of online steaming, both 
students’ and teachers’ videos are recorded separately and 
hence, there is no voice overlap in the video recordings. This 
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Audio Linguistic 
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Interaction 
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Figure 1: The workflow of our in-class features ex- 
traction. ASR is short for automatic speech recog- 
nition. 


Classroom 
Recoding 


In-class 
Features 


avoids the unsolved challenge of speaker diarization [2]. Sim- 
ilar to Blanchard et al. [4], we find that publicly available 
ASR service may yield inferior performance in the noisy and 
dynamic classroom environments. Therefore, we train our 
own ASR models on the classroom specific dataset based on 
a deep feed-forward sequential memory network, proposed 
by Zhang et al. [60]. Our ASR has a word error rate of 
28.08% in our classroom settings. 


4.1.2 Out-of-class Features 

As we discussed in Section 1, personalized K-12 online tutor- 
ing is a paid service in most countries. Besides the course 
quality itself, there are multiple other factors in such ser- 
vice industry that may change customers’ minds to drop 
the class. Therefore, out-of-class features play an extremely 
important role in identifying at-risk students in real-world 
K-12 online scenarios, which are typically ignored in previ- 
ous literatures. In this work, we collect and summarize all 
the available out-of-class features and divide them into the 
following two categories. The illustrative examples are listed 
in Table 1. 


e Pre-class features: Pre-class features capture the 
students’ (or even their parents’) behaviors before tak- 
ing the class, such as purchasing behaviors, promotion 
negotiations, etc. Examples: the number of rounds 
of conversation and negotiation before the class, how 
much the discount student received, etc. 


e Post-class features: Post-class features model the 
offline activities in such paid K-12 online services. For 
examples, students and their parents receive follow- 
ups based on their previous class performance and give 
their satisfaction feedbacks. Another example is that 
students may request changes to their course sched- 
ules. 


4.1.3 Time-variant Features 

Besides in-class and out-of-class features, we manually de- 
sign time-variant features to model the changes of likelihood 
of students’ dropout intentions. Cases like a student just 
had a class compared to a student had a class two weeks 
ago should be explicitly distinguished when constructing fea- 
tures. Therefore, we create time-variant features by utiliz- 
ing a lookback window approach on students’ observation 
sequences. More specifically, for a given timestamp, we only 
focus on previously observed activities of each student within 
a period of time. The length of lookback windows varies 
from 1 to 30 days. Sufficient statistics are extracted as time- 
variant features from each lookback window. Meanwhile, we 
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Table 1: List of examples in in-class, out-of-class, and time-variant features. 


Category | Type Examples 


Prosodic 


the average signal energies of student and teacher 

the average loudness of student and teacher 

the Mel-frequency cepstral coefficients of audio tracks from student and teacher 
the zero-crossing rates of student and teacher 


In-class Linguistic 


# of sentences per class of student and teacher 

#: of pause words per class of student and teacher 

average lengths of sentences per class of student and teacher 
voice speeds (char per second) of student and teacher 


# of teacher-student Q&A rounds 
#: of times the teacher reminds the student to take notes and summarization 


Interaction #: of times the teacher asks the student to repeat 


#: of times the teacher clarifies the student’s questions 


# of days since the student places the online course order 
# of courses in the student’s order 


Pre-class # of conversations between the sales staff and the students (or their parents) 


Out-of-class 


the discount ratio of the student’s order 


#: of follow-ups after the student took the first class 
# of words in the latest follow-up report 


Post-class #: of times the student reschedules the class 


the follow-up ratio, i.e., # of follow-ups divided by # of taken courses 


Historical performance 


Time-variant 


# of courses taught by each individual teacher in total 
# of courses the student had in total 

historical dropout rates 

historical average time span between classes 


Lookback window 


compute historical performance features to reflect the teach- 
ing experience and performance for each individual teacher. 
Table 1 shows some examples of time-variant feature we use 
in our dropout prediction framework. 


e Lookback window features: The lookback window 
features aggregate important statistics from students’ 
observations within a fixed-length lookback window, 
such as the numbers of courses taken in past one, two, 
three weeks. 


e Historical performance features: The historical 
features aggregate each teacher’s past teaching per- 
formance, which represent the overall teaching quality 
profiles. They include total numbers of courses and 
students taught, historical dropout rates, etc. 


4.2 Data Augmentation 
According to Assumptions 1 and 2 and the problem formu- 
lation in Section 3.2, a real-world early warning system is 


# of courses taken in past one/two/three weeks 

# of courses the student scheduled in past one/two/three weeks 

# of positive/negative follow-up reports in past one/two/three weeks 
the average time span of classes taken in past month 


supposed to flexibly support multi-step ahead predictions for 
each student, i.e., given any future time span A, the system 
computes the probability of student’s status gs aie The 


predicted probability should be able to dynamically adapt 
when the values of A get changed. The multi-step ahead as- 
sumption essentially requires the approach to make predic- 
tions at a more fine-grained granularity of <student, times- 
tamp> pair, i.e, < s,t;,4a >, instead of student level, ie., 
s. This poses a challenging question: due to the fact that 
only about 20% of K-12 students drop their online classes, 
how do we tackle the class imbalance problem when extract- 
ing <student, timestamp> training pairs from a collection 
of multimodal observation sequences (either completion or 
dropout) in K-12 online scenarios? 


Let Si and S2 be the set of student indices of dropout and 
non-dropout students, i.e., Si = {ilyn, = dropout}, and 
S. = {ilyh, = completion}. Let P and N be the sets 
of positive (dropout) and negative (non-dropout) <student, 
timestamp> pairs. By definition, P and N are constructed 
as follows: 
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Pojex i, > lie Si} 
N={<xi,ti > lie Si,ke Ti\ti,} 
U{< xi, t) > |j © S2,k € T;} (1) 


Similar to many researches such as fraud detection [53], the 
sizes of P and N are typically very imbalanced. While in 
some cases the class imbalance problem may be alleviated by 
applying an over-sampling algorithm on the minority class 
sample set, the diversity of the available instances is of- 
ten limited. Therefore, in this work, we propose a time- 
aware data augmentation technique that artificially gener- 
ates pseudo positive (dropout) <student, timestamp> pairs. 


More specifically, for each dropout student i in Si, we set a 


lookback window with length A where A < ths — ae For 
each timestamp ¢} in the lookback window such that 
max(ty,,_,,th, — A) < fj < t,. (2) 


We generate its corresponding pseudo positive training pair 
< x/,t] > as follows: x} = F(X*,T*) where F(-,-) is the 
generation function. The choices of F(-,-) are flexible and 
vary among different types of features (See Section 4.1). In 
this work, for in-class and out-of-class features, we aggregate 
all the available features till ¢} and re-compute the time- 
variant features according to timestamp i}. We use P to 
represent the collection of all positive training pairs gener- 
ated from dropout students in S;. Figure 2 illustrates how 
the pseudo positive training pairs are generated. 


a EH EEE EE 4] 
Original TNDTT”T"oiwrew—weo 
ty ty t3 C4 ts tG tr 


Data a | iy | fy | 


Augmentation ti th i ti, ti ti 


EEE 
3 % a t 
— A —. 


i Negative [i Positive {| Augmented Positive 


Figure 2: Graphical illustration of the data augmen- 
tation technique. 


Besides, we assign a time-aware weight to each pseudo posi- 
tive training pair to reflect the recency effect in Assumption 
1. For each pseudo pair < X},t; >, the corresponding weight 
w} is computed by 


ti, fi 


wi = GE) (3) 


where the weighting function G(-) takes the normalized time 
span between each timestamp of pseudo pair and the exact 
dropout date as input and outputs a normalized weighting 
score to reflect our confidence on the “positiveness” of the 
simulated training pairs. The closer to the dropout date, the 
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larger the confidence weights should be. The choices of G(-) 
are open to any function that gives response values ranging 
from 0 to 1, such as linear, convex or concave functions 
illustrated in Figure 3. 


1.0 —— Linear, cG(x)=1-x 
— Concave, G(x) =1—- x? 
— Convex, G(x) = (x — 1)? 
0.8 
cD) 
o 
S 0.6 
n 
Ww 
iS 
dD) 
o 0.4 
= 
0.2 
0.0 


0.0 0.2 0.4 0.6 0.8 1.0 
normalized time span 


Figure 3: Graphical illustration of different weight- 
ing function of G(-). 


The effect of different choices of weighting function is dis- 
cussed in Section 5.4. The augmented training set P and 
the corresponding time-aware weights are used in the model 
training in Section 4.3. 


4.3 Model Learning 


In the learning stage, we combine the original training set 
(P and N) with the augmented set P for model training. 
Even though the data augmentation alleviates the class im- 
balance problem, i.e., improving the positive example ra- 
tio from 0.1% to 10%, the imbalance problem still exists. 
Therefore, we employ the classical weighted over-sampling 
algorithm on positive pairs to further reduce the imbalance 
effect. Here, the weights of the original positive examples 
in P are set to 1 and pseudo positive examples’ weights are 
computed by G(-) in Section 4.2. Here, since the dropout 
datasets are usually small compared to other Internet scaled 
datasets, we choose to use Gradient Boosting Decision Tree’ 
(GBDT) [16] as our prediction model. The GBDT exhibits 
its robust predictive performance in many well studied prob- 
lems [24, 48]. 


4.4 Summary 
The overall model learning procedure of our K-12 online 
dropout prediction can be summarized in Algorithm 1. 


5. EXPERIMENTAL EVALUATION 


In this section, we will (1) introduce our dataset that is 
collected from a real-world K-12 online learning platform 
and the details of our experimental settings (Section 5.1); 
(2) show that our approach is able to improve the predic- 
tive performance when compared to a wide range of classic 


“https: / /scikit-learn.org/stable/modules/ensemble.html#-gradient- 


tree-boosting 
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Algorithm 1 Model learning procedure of the K-12 online 
dropout prediction. 


INPUT: 
e A set of K-12 students S and their corresponding multi- 
modal classroom recordings and activities logs. 
e The length of lookback window A. 
e The choice of weighting function G(-). 
PROCEDURE: 


: // Feature extraction 

Extract in-class features from multimodal recordings, see Sec- 

tion 4.1.1. 

Extract out-of-class features from student activities logs, see 

Section 4.1.2. 

Extract time-variant features, see Section 4.1.3. 

Concatenate three types of features above. 

// Label generation and augmentation 

Create original positive and negative training pair sets, i.e., 

P and N, see eq.(1). 

8: Generate the augmented pseudo positive training sets, i.e., P 
and the corresponding weights, see eq.(3). 

9: // Model learning 

10: Conduct weighted over-sampling on the union of P and P. 

11: Train the GBDT model on the over-sampled positive exam- 

ples and original negative examples. 


OUTPUT: 
e The GBDT dropout prediction model (2. 


baselines (Section 5.2); (3) evaluate the impacts of different 
sizes of lookback windows, different weighting functions in 
data augmentation and feature combinations (Section 5.3, 
Section 5.4 and Section 5.5); and (4) deploy our model into 
the real production system to demonstrate its effectiveness 
(Section 5.6). 


We would also like to note that hyper parameters used in 
our methods are selected (in all experiments) by the internal 
cross validation approach while optimizing models’ predic- 
tive performances. In the following experiment, we set the 
size of lookback window to 7 and the impact of window size 
is discussed in section 5.4. We choose to use the convex 
weighting function when conducting pseudo positive data 
augmentation. 


5.1 Experimental Setting 
5.1.1 Data 


To evaluate the effectiveness of our proposed framework, 
we conduct several experiments on a real-world K-12 on- 
line course dataset from a third-party online education plat- 
form. We select 3922 registered middle school and high 
school students from August 2018 and February 2019 as our 
samples. All the features listed in Section 4.1 are computed 
and extracted from daily activity logs on the platform. In 
our dataset, 634 students choose to drop the class and the 
dropout rate is 16.16%. The average time span of the stu- 
dents on the platform is about 86 days, which provide us 
338428 observational <student, time stamp> sample pairs 
in total. We randomly select 80% of students and use their 
corresponding <student, time stamp> sample pairs as train- 
ing set and the remaining 20% of students’ sample pairs for 
testing propose. The data augmentation technique discussed 
in Section 4.2 is only applied in training set. 


5.1.2. Multi-step Ahead Prediction Setting 


To fully examine the dropout prediction performance, we 
evaluate the model’s predictions in terms of different multi- 
step ahead time spans, i.e, given a current timestamp, we 
predict the outcome (dropout or non-dropout) in the next 
X days, where X ranges from 1,2,--- ,14. 


5.1.3 Evaluation Metric 

Similar to [18, 50, 17, 15, 51, 21], we evaluate and com- 
pare the performance of the different methods by using the 
Area Under Curve (AUC) score, which is the area under 
the Receive Operating Characteristic curve (ROC) [14]. An 
ROC curve is a graphic plot created by plotting the true 
positive rate (TPR) against the false positive rate (FPR). 
In our dropout prediction scenario, the TPR is the fraction 
of the “at-risk” predicted students who truly drop out. The 
FPR is the ratio of the falsely predicted “dropout” students 
to the true ones. The AUC score is invariant to data imbal- 
ance issue and it does not require additional parameters, for 
models comparisons. AUC score reaches its best value at 1 
and the worst at value 0. 


5.1.4 Baselines 

We compare our proposed approach with the following rep- 
resentative baseline methods: (1) Logistic Regression (LR) 
[28], (2) Decision Tree (DT) [46] and (3) Random Forest 
(RF) [22]. LR, DT and RF are all trained on the same set 
of features defined in Section 4.1 with our proposed method. 
The training set is created by using eq.(1). 


5.2 The Overall Prediction Performance 
The results of these models are shown in Figure 4. As we can 
see from the Figure 4, we have the following observations: 
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Figure 4: The overall prediction performance with 
different multi-step ahead time spans in terms of 
AUC scores. 


e First, our model outperforms all other methods in terms 
of AUC scores on different future time spans, which 
demonstrates the effectiveness of our approaches with 
positive data augmentation. By adding more diverse 
pseudo positive training pairs with the corresponding 
decaying confidence weights, the GBDT model is able 
to learn the dropout patterns from multiple factors. 
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e Second, as we increase the lengths of time spans of 
multi-step ahead prediction, all the models’ perfor- 
mances decrease accordingly. Our approach achieves 
AUC score of 0.8262 in the task of next day predic- 
tion while the performance downgrades to 0.7430 in 
the next two-week prediction task. We believe this 
is because of the truth negative mistakes the models 
make, i.e., the model thinks the students will continue 
but they drop classes in next two weeks. This indi- 
cates that without knowing more information from the 
students, the ML models have very limited ability in 
predicting the long-term outcomes of student status, 
which also reflects the fact that there are many factors 
that could lead to the dropouts. 


e Third, comparing LR, DT, and RF, we can see, the 
DT achieves the worst performance. This is because 
of its instability. With small number of training data, 
the DT approach suffers from fractional data turbu- 
lence. The RF approach remedies such shortcomings 
by replacing a single decision tree with a random for- 
est of decision trees and the performance is boosted. 
Meanwhile, as a linear model, the LR is not powerful 
enough to accurately capture the dropout cases. 


5.3. Impact of Sizes of Lookback Windows 

As we can see, the number of augmented positive training 
pairs is directly determined by the size of lookback window 
A. Therefore, to comprehensively understand the perfor- 
mance of our proposed approach, we conduct experimental 
comparisons on different sizes of lookback windows. We vary 
the window size from 3, 7, and 14. Meanwhile, we add a 
baseline with no data augmentation. The results are shown 
in Figure 5. 


From Figure 5, we can see that the size of lookback win- 
dows has a positive relationship on AUC scores with the 
length of time span in multi-step ahead prediction. When 
conducting short-term dropout predictions, models trained 
on data augmentation with smaller size of lookback win- 
dow outperform others. As we gradually increase the time 
span of future predictions, the more the model looks back, 
the higher the prediction AUC score it achieves. Overall, 
the model trained with 7-day lookback window has the best 
performance across different multi-step ahead time spans in 
terms of AUC scores. 


5.4 Impact of Different Weighting Functions 
In this section, we examine the performance changes by 
varying the forms of weighting functions. More specifically, 
we compare the prediction results of using the convex func- 
tion to results of the other choices. The results are shown 
in Figure 6. As we can see from Figure 6, the convex op- 
tion outperforms other choices by a large margin across all 
different multi-step ahead time spans. When computing the 
over-sampling weights of pseudo training examples, the con- 
vex function gives more weights to the most recent examples, 
i.e., examples close to the timestamp of true dropout obser- 
vations. This also confirms the necessity of considering the 
recency effect assumption (Assumption 1) when building the 
dropout prediction framework. 


5.5 Impact of Different Features 
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Figure 5: Models trained on data augmented by dif- 
ferent size of lookback windows with different multi- 
step ahead time spans in terms of AUC scores. none 
represents the model training without any lookback 
data augmentation. 


In this subsection, we systematically examine the effect of 
different types of features by constructing following model 
variants: 


e In: only the in-class features are used. 
e Out: only the out-of-class features are used. 
e Time: only the time-variant features are used. 


e In+Time: it eliminates the contribution of Out fea- 
tures and only uses features from In and Time. 


e Out+Time: it eliminates the contribution of In fea- 
tures and only uses features from Out and Time. 


e In+Out: it eliminates the contribution of Time fea- 
tures and only uses features from In and Out. 


e In+Out+Time: it uses the combination of all the fea- 
tures from In, Out and Time. 


Meanwhile, we also consider different multi-step ahead pre- 
diction settings, ie., next 7-day prediction and next 14-day 
prediction and the prediction results are shown in Table 2. 
From Table 2, we observe that (1) by considering all three 
types of features individually, the model trained from Out 
features yields the best performance. Moreover, when com- 
paring In, Out to In+ Time, Out+Time, we obtain the sig- 
nificant performance improvement by adding Out features. 
These indicate the fact that dropout prediction for K-12 on- 
line scenarios are very different from MOOC based dropout 
prediction. The out-of-class activities and the quality of the 
service play an extremely important role in the prediction 
task; and (2) by utilizing all the sets of features, we could 
be able to achieve the best results in both prediction tasks. 
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Table 2: Experimental results of different types of features and different lengths of multi-step ahead time 


span in terms of AUC scores. 


| In | Out | Time | In+Time | Out+Time | In+Out | In+Out+Time 
0.6992 0.7145 0.7759 0.7768 0.7774 
0.6932 0.7393 0.7420 0.7430 


Multi-step ahead time span - 7 day | 0.6249 | 0.7764 


Multi-step ahead time span - 14 day | 0.6251 | 0.7386 
—e— linear 
0.82 —— concave 
—— convex 
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Figure 6: Models trained on data augmented by dif- 
ferent choices of weighting functions with different 
multi-step ahead time spans in terms of AUC scores. 


5.6 Online Performance 

We deployed our at-risk student warning system in the real 
production environment on a third-party platform between 
February 2nd, 2019 to April 1st, 2019. To watch the system 
performance in practice, we conduct the next-day predic- 
tion task where the system predicts the dropout probability 
for each on-going student at 6 am in the morning. All the 
students are ranked by their dropout probabilities and the 
top 30% of students with highest probabilities are marked as 
at-risk students. At the end of each day, we obtain the real 
outcome of all the students who drop the class. We conduct 
the overlap comparison between the predicted top at-risk 
students (30% of total students) and the daily dropouts and 
we are able to achieve that more than 70% of dropout stu- 
dents are detected by the system. 


6. CONCLUSION 


In this paper, we present an effective at-risk student identi- 
fication framework for K-12 online classes. Compared to the 
existing dropout prediction researches, our approach consid- 
ers and focuses on the challenging factors such as multiple 
modalities, length variability, time sensitivity, class imbal- 
ance problems when learning from real-world K-12 educa- 
tional data. Our offline experimental results show that our 
approach outperforms other state-of-the-art prediction ap- 
proaches in terms of AUC scores. Furthermore, we deploy 
our model into a production environment and we are able 
to achieve that more than 70% of dropout students are de- 
tected by the system. In the future, we plan to explore the 
opportunity of using deep neural networks to capture hetero- 
geneous information in the K-12 online scenarios to enhance 


0.6766 


the existing prediction pipeline. 
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